• Manipulación de Datos
  • Visualización de Datos
  • Análisis Exploratorio Básico
  • Análisis en Componentes Principales - ACP

Para mostrar un gráfico en un archivo rmd use show() y close()

Esto solamente es necesario si se genera el HTML en RStudio, NO es necesario en Jupiter ni en Spider

import matplotlib.pyplot as plt
def open_close_plot():
    plt.show()
    plt.close()

Manipulación de Datos

  • numpy permite manejar objetos tipo arreglo (matriz) en Python.
  • En Python el tipado de datos es dinámico.
  • Todos los tipos en realidad son clases.

** Ejemplo: (Código C)**

int resultado = 0;
int i;
for(i=0; i<100; i++){
   resultado += i;
}
printf("%i",resultado);

** Código Python**

resultado = 0
for i in range(100):
    resultado += i
print(resultado)
4950

** Nota:** En C resultado es un entero en Python es un objeto de la clase entero

** Otro ejemplo (Código Python)**

x = 4
x = "cuatro"

** Código C**

int x = 4;
x = "cuatro";  // FALLA

Otro ejemplo son las listas de Python:

  • Son mucho más que listas o arreglos
  • Son listas polimórficas de objetos
L3 = [True, "2", 3.0, 4]
print([type(i) for i in L3])
[<class 'bool'>, <class 'str'>, <class 'float'>, <class 'int'>]

Importante: A veces por razones de eficiencia conviene dar tipo a los datos

import numpy as np

** Arreglo de tipo entero por defecto**

print(np.array([1, 4, 2, 5, 3]))
[1 4 2 5 3]

** Arreglo de tipo flotante**

print(np.array([1, 2, 3, 4], dtype='float32'))
[1. 2. 3. 4.]

** Otros ejemplos de arreglos en numpy**

print(np.array([range(i, i + 3) for i in [2, 4, 6]]))
[[2 3 4]
 [4 5 6]
 [6 7 8]]
print(np.zeros(10, dtype=int))
[0 0 0 0 0 0 0 0 0 0]
print(np.ones((3, 5), dtype=float))
[[1. 1. 1. 1. 1.]
 [1. 1. 1. 1. 1.]
 [1. 1. 1. 1. 1.]]
print(np.full((3, 5), 3.14))
[[3.14 3.14 3.14 3.14 3.14]
 [3.14 3.14 3.14 3.14 3.14]
 [3.14 3.14 3.14 3.14 3.14]]
print(np.arange(0, 20, 2))
[ 0  2  4  6  8 10 12 14 16 18]
print(np.random.random((3, 3)))
[[0.52869319 0.44657109 0.43302533]
 [0.56920995 0.52450806 0.66529526]
 [0.48084003 0.77142621 0.68855835]]
print(np.random.normal(0, 1, (3, 3)))
[[ 0.192064    1.36762902 -0.73025102]
 [ 0.62324648 -0.02916181  0.52112409]
 [-0.76949374  0.72094898 -0.50146181]]

** Operaciones en un Arreglo**

x = np.arange(4)
print("x     =", x)
x     = [0 1 2 3]
print("x + 5 =", x + 5)
x + 5 = [5 6 7 8]
print("x - 5 =", x - 5)
x - 5 = [-5 -4 -3 -2]
print("x * 2 =", x * 2)
x * 2 = [0 2 4 6]
print("x / 2 =", x / 2)
x / 2 = [0.  0.5 1.  1.5]
print("x // 2 =", x // 2)  # división entera
x // 2 = [0 0 1 1]
x = np.array([-2, -1, 0, 1, 2])
print(abs(x))
[2 1 0 1 2]
x = [1, 2, 3]
print("x     =", x)
x     = [1, 2, 3]
print("e^x   =", np.exp(x))
e^x   = [ 2.71828183  7.3890561  20.08553692]
print("2^x   =", np.exp2(x))
2^x   = [2. 4. 8.]
print("3^x   =", np.power(3, x))
3^x   = [ 3  9 27]
x = [1, 2, 4, 10]
print("x        =", x)
x        = [1, 2, 4, 10]
print("ln(x)    =", np.log(x))
ln(x)    = [0.         0.69314718 1.38629436 2.30258509]
print("log2(x)  =", np.log2(x))
log2(x)  = [0.         1.         2.         3.32192809]
print("log10(x) =", np.log10(x))
log10(x) = [0.         0.30103    0.60205999 1.        ]
L = np.random.random(10000000)
print(sum(L))
5000335.350233283
print(np.sum(L)) # Corre mucho más rápido
5000335.350233365
print(min(L))
9.239417675388495e-08
print(np.min(L)) # Corre mucho más rápido
9.239417675388495e-08
print(max(L))
0.9999999399907006
print(np.max(L)) # Corre mucho más rápido
0.9999999399907006
M = np.random.random((3, 4))
print(M)
[[0.55105522 0.94320691 0.36861327 0.91139172]
 [0.68438111 0.17027191 0.29532868 0.91794155]
 [0.53047182 0.81232561 0.08428512 0.27012474]]
print(M.sum()) # Es el llamado a un método de nunpy
6.539397659005901
print(M.min(axis=0)) # Por columna
[0.53047182 0.17027191 0.08428512 0.27012474]
print(M.min(axis=1)) # Por fila
[0.36861327 0.17027191 0.08428512]
datos = np.array([189, 170, 189, 163, 183, 171, 185, 168, 173, 183, 173, 173,
                  175, 178, 183, 193, 178, 173, 174, 183, 183, 168, 170, 178,
                  182, 180, 183, 178, 182, 188, 175, 179, 183, 193, 182, 183,
                  177, 185, 188, 188, 182, 185])
print(datos)
[189 170 189 163 183 171 185 168 173 183 173 173 175 178 183 193 178 173
 174 183 183 168 170 178 182 180 183 178 182 188 175 179 183 193 182 183
 177 185 188 188 182 185]
print("Promedio:            ", datos.mean())
Promedio:             179.73809523809524
print("Desviación Estándar: ", datos.std())
Desviación Estándar:  6.931843442745892
print("Mínimo:              ", datos.min())
Mínimo:               163
print("Máximo:              ", datos.max())
Máximo:               193
print("Percentile 25:       ", np.percentile(datos, 25))
Percentile 25:        174.25
print("Mediana :            ", np.median(datos))
Mediana :             182.0
print("Percentile 75:       ", np.percentile(datos, 75))
Percentile 75:        183.0

** Centrando un arreglo**

X = np.random.random((10, 3))
print(X)
[[0.75091584 0.64689066 0.81832308]
 [0.99632631 0.75363456 0.4247169 ]
 [0.53132617 0.03999974 0.20514494]
 [0.23978885 0.04311217 0.82787171]
 [0.35406917 0.25734554 0.48299642]
 [0.91718336 0.69447619 0.78475718]
 [0.18264308 0.77555032 0.45953422]
 [0.93146611 0.09848661 0.37598232]
 [0.99479358 0.38388373 0.36092121]
 [0.40944035 0.67059264 0.06172486]]
Xmedia = X.mean(0)
print(Xmedia)
[0.63079528 0.43639722 0.48019728]
X_centrado = X - Xmedia
print(X_centrado)
[[ 0.12012055  0.21049345  0.33812579]
 [ 0.36553103  0.31723734 -0.05548039]
 [-0.09946911 -0.39639748 -0.27505235]
 [-0.39100643 -0.39328505  0.34767443]
 [-0.27672611 -0.17905168  0.00279914]
 [ 0.28638808  0.25807898  0.3045599 ]
 [-0.44815221  0.3391531  -0.02066306]
 [ 0.30067083 -0.33791061 -0.10421496]
 [ 0.3639983  -0.05251349 -0.11927607]
 [-0.22135493  0.23419542 -0.41847242]]

** Operaciones en Arreglos similares a las de R**

x = np.array([1, 2, 3, 4, 5])
print(x < 3)
[ True  True False False False]
print(np.less(x,3)) # Internamente se invoca el método less de la clase numpy
[ True  True False False False]
print(x > 3)
[False False False  True  True]
print(x <= 3)
[ True  True  True False False]
print(x >= 3)
[False False  True  True  True]
print(x != 3)
[ True  True False  True  True]
print(x == 3)
[False False  True False False]
rng = np.random.RandomState(0)
x = rng.randint(10, size=(3, 4))
print(x)
[[5 0 3 3]
 [7 9 3 5]
 [2 4 7 6]]
print(x <= 3)
[[False  True  True  True]
 [False False  True False]
 [ True False False False]]
print(np.count_nonzero(x < 6))
8

** & = AND** y ** | = OR**

print(np.sum(x < 6)) # Es equivalente al anterior False=0 y True=1
8
print(np.sum(x < 6, axis=1))  # Por fila
[4 2 2]
print(np.sum((x > 4) & (x < 6)))
2
print(np.sum(~(x > 4) & (x < 6)))
6
print(np.sum((x > 4) | (x < 6)))
12
# Para saber si hay o no valores más grandes que 8?
print(np.any(x > 8))
True

** Similar que en R**

print(x)
[[5 0 3 3]
 [7 9 3 5]
 [2 4 7 6]]
print(x[x < 4])
[0 3 3 3 2]
print(np.sum(x[x < 4]))
11

** Operador OR**

A = np.array([1, 0, 1, 0, 1, 0], dtype=bool)
B = np.array([1, 1, 1, 0, 1, 1], dtype=bool)
print(A | B)
[ True  True  True False  True  True]

** Arreglos de índices “Fancy Indexing”**

rand = np.random.RandomState(42) # Fija la semilla aleatoria
x = rand.randint(100, size=10)
print(x)
[51 92 14 71 60 20 82 86 74 74]
print([x[3], x[7], x[2]])
[71, 86, 14]

** Es equivalente, pero pasando los indices en un arreglo**

ind = [3, 7, 4]
print(x[ind])
[71 86 60]
ind = np.array([[3, 7],
                [4, 5]])
print(x[ind])
[[71 86]
 [60 20]]

Ejemplo de selección de puntos

** Seleccionando puntos al azar**

mean = [0, 0]
cov = [[1, 2],
       [2, 5]]
X = rand.multivariate_normal(mean, cov, 100)
print(X)
[[-0.644508   -0.46220608]
 [ 0.7376352   1.21236921]
 [ 0.88151763  1.12795177]
 [ 2.04998983  5.97778598]
 [-0.1711348  -2.06258746]
 [ 0.67956979  0.83705124]
 [ 1.46860232  1.22961093]
 [ 0.35282131  1.49875397]
 [-2.51552505 -5.64629995]
 [ 0.0843329  -0.3543059 ]
 [ 0.19199272  1.48901291]
 [-0.02566217 -0.74987887]
 [ 1.00569227  2.25287315]
 [ 0.49514263  1.18939673]
 [ 0.0629872   0.57349278]
 [ 0.75093031  2.99487004]
 [-3.0236127  -6.00766046]
 [-0.53943081 -0.3478899 ]
 [ 1.53817376  1.99973464]
 [-0.50886808 -1.81099656]
 [ 1.58115602  2.86410319]
 [ 0.99305043  2.54294059]
 [-0.87753796 -1.15767204]
 [-1.11518048 -1.87508012]
 [ 0.4299908   0.36324254]
 [ 0.97253528  3.53815717]
 [ 0.32124996  0.33137032]
 [-0.74618649 -2.77366681]
 [-0.88473953 -1.81495444]
 [ 0.98783862  2.30280401]
 [-1.2033623  -2.04402725]
 [-1.51101746 -3.2818741 ]
 [-2.76337717 -7.66760648]
 [ 0.39158553  0.87949228]
 [ 0.91181024  3.32968944]
 [-0.84202629 -2.01226547]
 [ 1.06586877  0.95500019]
 [ 0.44457363  1.87828298]
 [ 0.35936721  0.40554974]
 [-0.90649669 -0.93486441]
 [-0.35790389 -0.52363012]
 [-1.33461668 -3.03203218]
 [ 0.02815138  0.79654924]
 [ 0.37785618  0.51409383]
 [-1.06505097 -2.88726779]
 [ 2.32083881  5.97698647]
 [ 0.47605744  0.83634485]
 [-0.35490984 -1.03657119]
 [ 0.57532883 -0.79997124]
 [ 0.33399913  2.32597923]
 [ 0.6575612  -0.22389518]
 [ 1.3707365   2.2348831 ]
 [ 0.07099548 -0.29685467]
 [ 0.6074983   1.47089233]
 [-0.34226126 -1.10666237]
 [ 0.69226246  1.21504303]
 [-0.31112937 -0.75912097]
 [-0.26888327 -1.89366817]
 [ 0.42044896  1.85189522]
 [ 0.21115245  2.00781492]
 [-1.83106042 -2.91352836]
 [ 0.7841796   1.97640753]
 [ 0.10259314  1.24690575]
 [-1.91100558 -3.66800923]
 [ 0.13143756 -0.07833855]
 [-0.1317045  -1.64159158]
 [-0.14547282 -1.34125678]
 [-0.51172373 -1.40960773]
 [ 0.69758045  0.72563649]
 [ 0.11677083  0.88385162]
 [-1.16586444 -2.24482237]
 [-2.23176235 -2.63958101]
 [ 0.37857234  0.69112594]
 [ 0.87475323  3.400675  ]
 [-0.86864365 -3.03568353]
 [-1.03637857 -1.18469125]
 [-0.53334959 -0.37039911]
 [ 0.30414557 -0.5828419 ]
 [-1.47656656 -2.13046298]
 [-0.31332021 -1.7895623 ]
 [ 1.12659538  1.49627535]
 [-1.19675798 -1.51633442]
 [-0.75210154 -0.79770535]
 [ 0.74577693  1.95834451]
 [ 1.56094354  2.9330816 ]
 [-0.72009966 -1.99780959]
 [-1.32319163 -2.61218347]
 [-2.56215914 -6.08410838]
 [ 1.31256297  3.13143269]
 [ 0.51575983  2.30284639]
 [ 0.01374713 -0.11539344]
 [-0.16863279  0.39422355]
 [ 0.12065651  1.13236323]
 [-0.83504984 -2.38632016]
 [ 1.05185885  1.98418223]
 [-0.69144553 -1.56919875]
 [-1.2567603  -1.125898  ]
 [ 0.09619333 -0.64335574]
 [-0.99658689 -2.35038099]
 [-1.21405259 -1.77693724]]

** Dimensión de la matriz**

print(X.shape)
(100, 2)

** Modificando valores con arreglos de índices “fancy index”**

x = np.arange(10)
print(x)
[0 1 2 3 4 5 6 7 8 9]
i = np.array([2, 1, 8, 4])
x[i] = 99
print(x)
[ 0 99 99  3 99  5  6  7 99  9]

** Ordenando vectores con numpy**

x = np.array([9, 1, -4, 23, 5])
np.sort(x)

** Que es equivalente invocando el método (orientado a objetos), pues x es un objeto tipo numpy**

x.sort()
print(x)
[-4  1  5  9 23]

** Para obtener los índices del vector ordenado**

x = np.array([9, 1, -4, 23, 5])
i = np.argsort(x)
print(i)
[2 1 4 0 3]

** Ordenando por filas y columnas un arreglo**

rand = np.random.RandomState(42)
X = rand.randint(0, 10, (4, 6))
print(X)
[[6 3 7 4 6 9]
 [2 6 7 4 3 7]
 [7 2 5 4 1 7]
 [5 1 4 0 9 5]]
print(np.sort(X, axis=0)) # Ordena por columna
[[2 1 4 0 1 5]
 [5 2 5 4 3 7]
 [6 3 7 4 6 7]
 [7 6 7 4 9 9]]
print(np.sort(X, axis=1)) # Ordena por fila
[[3 4 6 6 7 9]
 [2 3 4 6 7 7]
 [1 2 4 5 7 7]
 [0 1 4 5 5 9]]

Manipulación de Datos en Pandas

Pandas permite manejar objetos tipo DataFrame en Python, es decir, una matriz con nombres de filas y columnas y un poco más, como series de tiempo.

** Para ver la versión de Pandas que uno tiene**

import pandas as pd
print(pd.__version__)
0.23.0

DataFrames en Pandas

poblacion = {'California': 38332521,
                  'Texas': 26448193,
               'New York': 19651127,
                'Florida': 19552860,
               'Illinois': 12882135}
area = {'California': 423967, 'Texas': 695662, 'New York': 141297,
           'Florida': 170312, 'Illinois': 149995}
estados = pd.DataFrame({'Poblacion': poblacion,'Area': area})
print(estados)
            Poblacion    Area
California   38332521  423967
Florida      19552860  170312
Illinois     12882135  149995
New York     19651127  141297
Texas        26448193  695662
print(estados.index)   # Observe que son objetos
Index(['California', 'Florida', 'Illinois', 'New York', 'Texas'], dtype='object')
print(estados.columns)
Index(['Poblacion', 'Area'], dtype='object')
print(estados['Area'])
California    423967
Florida       170312
Illinois      149995
New York      141297
Texas         695662
Name: Area, dtype: int64

** Otros ejemplos**

datos = [{'a': i, 'b': 2 * i} for i in range(30)]
print(pd.DataFrame(datos))
     a   b
0    0   0
1    1   2
2    2   4
3    3   6
4    4   8
5    5  10
6    6  12
7    7  14
8    8  16
9    9  18
10  10  20
11  11  22
12  12  24
13  13  26
14  14  28
15  15  30
16  16  32
17  17  34
18  18  36
19  19  38
20  20  40
21  21  42
22  22  44
23  23  46
24  24  48
25  25  50
26  26  52
27  27  54
28  28  56
29  29  58
print(pd.DataFrame(np.random.rand(3, 2),
             columns=['C1', 'C2'],
             index=['a', 'b', 'c']))
         C1        C2
a  0.459824  0.192620
b  0.705006  0.059341
c  0.631913  0.533214

** Valores nulos en Python “NaN”**

print(pd.DataFrame([{'a': 1, 'b': 2}, {'b': 3, 'c': 4}]))
     a  b    c
0  1.0  2  NaN
1  NaN  3  4.0

** Índices explícitos e implícitos - Indexadores: loc, iloc**

datos = pd.Series(['a', 'b', 'c'], index=[10, 30, 50])

** Nota:** Los índices explícitos son 10, 30 y 50; mientras que Los índices explícitos son 0, 1 y 2.

print(datos)
## 10    a
## 30    b
## 50    c
## dtype: object
print(datos[10]) # Índice explícito
## a
print(datos[1])  # Da error
## KeyError: 1
## 
## Detailed traceback: 
##   File "<string>", line 1, in <module>
##   File "/anaconda3/lib/python3.6/site-packages/pandas/core/series.py", line 766, in __getitem__
##     result = self.index.get_value(self, key)
##   File "/anaconda3/lib/python3.6/site-packages/pandas/core/indexes/base.py", line 3103, in get_value
##     tz=getattr(series.dtype, 'tz', None))
##   File "pandas/_libs/index.pyx", line 106, in pandas._libs.index.IndexEngine.get_value
##   File "pandas/_libs/index.pyx", line 114, in pandas._libs.index.IndexEngine.get_value
##   File "pandas/_libs/index.pyx", line 162, in pandas._libs.index.IndexEngine.get_loc
##   File "pandas/_libs/hashtable_class_helper.pxi", line 958, in pandas._libs.hashtable.Int64HashTable.get_item
##   File "pandas/_libs/hashtable_class_helper.pxi", line 964, in pandas._libs.hashtable.Int64HashTable.get_item
print(datos[0:2]) # Índices implícitos, esto es extraño y confuso
## 10    a
## 30    b
## dtype: object
print(datos[1:3]) # Índices implícitos, esto es extraño y confuso
## 30    b
## 50    c
## dtype: object

** Lo anterior causa mucha confusión, loc siempre se refiere al índice explícito**

print(datos.loc[10])
## a
datos.loc[1] # Da error
## KeyError: 'the label [1] is not in the [index]'
## 
## Detailed traceback: 
##   File "<string>", line 1, in <module>
##   File "/anaconda3/lib/python3.6/site-packages/pandas/core/indexing.py", line 1478, in __getitem__
##     return self._getitem_axis(maybe_callable, axis=axis)
##   File "/anaconda3/lib/python3.6/site-packages/pandas/core/indexing.py", line 1911, in _getitem_axis
##     self._validate_key(key, axis)
##   File "/anaconda3/lib/python3.6/site-packages/pandas/core/indexing.py", line 1798, in _validate_key
##     error()
##   File "/anaconda3/lib/python3.6/site-packages/pandas/core/indexing.py", line 1785, in error
##     axis=self.obj._get_axis_name(axis)))
print(datos.loc[10:30])
## 10    a
## 30    b
## dtype: object
print(datos.loc[30:50])
## 30    b
## 50    c
## dtype: object

** Seleccionando datos en un DataFrame**

poblacion = {'California': 38332521,
                   'Texas': 26448193,
                   'New York': 19651127,
                   'Florida': 19552860,
                   'Illinois': 12882135}
area = {'California': 423967, 'Texas': 695662, 'New York': 141297,
             'Florida': 170312, 'Illinois': 149995}
estados = pd.DataFrame({'Poblacion': poblacion,'Area': area})
print(estados)
            Poblacion    Area
California   38332521  423967
Florida      19552860  170312
Illinois     12882135  149995
New York     19651127  141297
Texas        26448193  695662
print(estados['Poblacion'])
California    38332521
Florida       19552860
Illinois      12882135
New York      19651127
Texas         26448193
Name: Poblacion, dtype: int64
print(estados.Poblacion)
California    38332521
Florida       19552860
Illinois      12882135
New York      19651127
Texas         26448193
Name: Poblacion, dtype: int64

** Agregando una variable**

estados['Densidad'] = estados['Poblacion'] / estados['Area']
print(estados)
            Poblacion    Area    Densidad
California   38332521  423967   90.413926
Florida      19552860  170312  114.806121
Illinois     12882135  149995   85.883763
New York     19651127  141297  139.076746
Texas        26448193  695662   38.018740

** Transponer el DataFrame**

ET = estados.T
print(ET)
             California       Florida      ...           New York         Texas
Poblacion  3.833252e+07  1.955286e+07      ...       1.965113e+07  2.644819e+07
Area       4.239670e+05  1.703120e+05      ...       1.412970e+05  6.956620e+05
Densidad   9.041393e+01  1.148061e+02      ...       1.390767e+02  3.801874e+01

[3 rows x 5 columns]

** Se puede usar iloc y loc en un DataFrame**

estados
print(estados.iloc[:3, :2]) # Índices implícitos
            Poblacion    Area
California   38332521  423967
Florida      19552860  170312
Illinois     12882135  149995
print(estados.loc[:'Illinois', :'Poblacion']) # Índices Explícitos
            Poblacion
California   38332521
Florida      19552860
Illinois     12882135

** Otros ejemplos**

print(estados.loc[estados.Densidad > 100, ['Poblacion', 'Densidad']])
          Poblacion    Densidad
Florida    19552860  114.806121
New York   19651127  139.076746
print(estados['Florida':'Illinois'])
          Poblacion    Area    Densidad
Florida    19552860  170312  114.806121
Illinois   12882135  149995   85.883763
print(estados[1:3])
          Poblacion    Area    Densidad
Florida    19552860  170312  114.806121
Illinois   12882135  149995   85.883763
print(estados[estados.Densidad > 100])
          Poblacion    Area    Densidad
Florida    19552860  170312  114.806121
New York   19651127  141297  139.076746

** Modificando un dato**

estados.iloc[0, 2] = 0
print(estados)
            Poblacion    Area    Densidad
California   38332521  423967    0.000000
Florida      19552860  170312  114.806121
Illinois     12882135  149995   85.883763
New York     19651127  141297  139.076746
Texas        26448193  695662   38.018740

Operaciones en DataFrames

Pandas hereda todas las funcionalides de numpy, pues Pandas es una especialización de numpy

import pandas as pd
import numpy as np
df = pd.DataFrame(rng.randint(0, 10, (3, 4)),columns=['A', 'B', 'C', 'D'])
print(df)
   A  B  C  D
0  8  8  1  6
1  7  7  8  1
2  5  9  8  9

** Aplica seno al DataFrame**

np.sin(df * np.pi / 4)

Datos Ausentes en Pandas

“None” es el indicador propio de Python para indicar datos ausentes, pero solo funciona con datos que hereden de la clase “object”

valores1 = np.array([1, None, 3, 4])
print(valores1) # Observe que dtype=object
[1 None 3 4]
import numpy as np
valores1 = np.array([1, None, 3, 4],dtype=int) # Causa error el None
## TypeError: int() argument must be a string, a bytes-like object or a number, not 'NoneType'
## 
## Detailed traceback: 
##   File "<string>", line 1, in <module>
valores1.sum() # causa error el None
## TypeError: unsupported operand type(s) for +: 'int' and 'NoneType'
## 
## Detailed traceback: 
##   File "<string>", line 1, in <module>
##   File "/anaconda3/lib/python3.6/site-packages/numpy/core/_methods.py", line 32, in _sum
##     return umr_sum(a, axis, dtype, out, keepdims)

** Uso del NaN (acrónimo para “Not a Number”)**

valores2 = np.array([1, np.nan, 3, 24])
print(valores2)
[ 1. nan  3. 24.]
print(valores2.sum())
nan

** No da error, da nan porque:**

print(1 + np.nan) # Da nan
nan
valores2.dtype  # Ahora es flotante
print(valores2.min())
nan
print(valores2.max())
nan

** Todo da nan, solución usar: **

print(np.nansum(valores2))
28.0
print(np.nanmin(valores2))
1.0
print(np.nanmax(valores2))
24.0

NaN and None en Pandas = Básicamente son intercambiables, se convierten a NaN

datos = pd.Series([1, np.nan, 2, None, 90, -10, 76])
print(datos)
0     1.0
1     NaN
2     2.0
3     NaN
4    90.0
5   -10.0
6    76.0
dtype: float64
print(datos.isnull()) # Para detectarlos
0    False
1     True
2    False
3     True
4    False
5    False
6    False
dtype: bool

** Eliminando valores nan, se debe eliminar toda la fila o toda la columna**

df = pd.DataFrame([[1,      np.nan, 2],
                   [2,      3,      5],
                   [np.nan, 4,      6],
                   [-1, 94,      0]])
print(df)
     0     1  2
0  1.0   NaN  2
1  2.0   3.0  5
2  NaN   4.0  6
3 -1.0  94.0  0
print(df.dropna())# Por defecto es por fila
     0     1  2
1  2.0   3.0  5
3 -1.0  94.0  0
print(df.dropna(axis='columns'))
   2
0  2
1  5
2  6
3  0

Imputando datos

** Ejemplo con un data frame**

df = pd.DataFrame([[1,      np.nan, 2],
                   [2,      3,      5],
                   [np.nan, 4,      6],
                   [-1, 94,      0]])
print(df)
     0     1  2
0  1.0   NaN  2
1  2.0   3.0  5
2  NaN   4.0  6
3 -1.0  94.0  0
print(df.fillna(0)) # Rellena (imputa el dato) con ceros
     0     1  2
0  1.0   0.0  2
1  2.0   3.0  5
2  0.0   4.0  6
3 -1.0  94.0  0
print(df)
     0     1  2
0  1.0   NaN  2
1  2.0   3.0  5
2  NaN   4.0  6
3 -1.0  94.0  0
print(df.fillna(method='bfill',axis=1)) # Rellena con en posterior de la fila
     0     1    2
0  1.0   2.0  2.0
1  2.0   3.0  5.0
2  4.0   4.0  6.0
3 -1.0  94.0  0.0
print(df)
     0     1  2
0  1.0   NaN  2
1  2.0   3.0  5
2  NaN   4.0  6
3 -1.0  94.0  0
print(df.fillna(method='bfill',axis=0)) # Rellena con en posterior de la columna
     0     1  2
0  1.0   3.0  2
1  2.0   3.0  5
2 -1.0   4.0  6
3 -1.0  94.0  0
print(df)
     0     1  2
0  1.0   NaN  2
1  2.0   3.0  5
2  NaN   4.0  6
3 -1.0  94.0  0
print(df.fillna(method='ffill',axis=1)) # Rellena con en anterior de la fila
     0     1    2
0  1.0   1.0  2.0
1  2.0   3.0  5.0
2  NaN   4.0  6.0
3 -1.0  94.0  0.0
print(df)
     0     1  2
0  1.0   NaN  2
1  2.0   3.0  5
2  NaN   4.0  6
3 -1.0  94.0  0
print(df.fillna(method='ffill',axis=0)) # Rellena con en anterior de la columna
     0     1  2
0  1.0   NaN  2
1  2.0   3.0  5
2  2.0   4.0  6
3 -1.0  94.0  0
print(df)
     0     1  2
0  1.0   NaN  2
1  2.0   3.0  5
2  NaN   4.0  6
3 -1.0  94.0  0
print(np.mean(df))
0     0.666667
1    33.666667
2     3.250000
dtype: float64
print(df.fillna(np.mean(df))) # Rellena con la media por columna
          0          1  2
0  1.000000  33.666667  2
1  2.000000   3.000000  5
2  0.666667   4.000000  6
3 -1.000000  94.000000  0

Combinando Sets de Datos: Concat y Append

** En numpy**

x = [1, 2, 3]
y = [4, 5, 6]
z = [7, 8, 9]
print(np.concatenate([x, y, z]))
[1 2 3 4 5 6 7 8 9]
x = [[1, 2],[3, 4]]
y = [[-1, -2],[-3, -4]]
print(np.concatenate([x, y], axis=1))
[[ 1  2 -1 -2]
 [ 3  4 -3 -4]]

** En pandas**

** Ejemplo DataFrame por filas**

df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
                    'B': ['B0', 'B1', 'B2', 'B3'],
                    'C': ['C0', 'C1', 'C2', 'C3'],
                    'D': ['D0', 'D1', 'D2', 'D3']},
                    index=[0, 1, 2, 3])
print(df1)
    A   B   C   D
0  A0  B0  C0  D0
1  A1  B1  C1  D1
2  A2  B2  C2  D2
3  A3  B3  C3  D3
df2 = pd.DataFrame({'A': ['A4', 'A5', 'A6', 'A7'],
                    'B': ['B4', 'B5', 'B6', 'B7'],
                    'C': ['C4', 'C5', 'C6', 'C7'],
                    'D': ['D4', 'D5', 'D6', 'D7']},                      
                    index=[4, 5, 6, 7])
print(df2)
    A   B   C   D
4  A4  B4  C4  D4
5  A5  B5  C5  D5
6  A6  B6  C6  D6
7  A7  B7  C7  D7
df3 = pd.DataFrame({'A': ['A8', 'A9', 'A10', 'A11'],
                    'B': ['B8', 'B9', 'B10', 'B11'],
                    'C': ['C8', 'C9', 'C10', 'C11'],
                    'D': ['D8', 'D9', 'D10', 'D11']},
                    index=[8, 9, 10, 11])
print(df3)
      A    B    C    D
8    A8   B8   C8   D8
9    A9   B9   C9   D9
10  A10  B10  C10  D10
11  A11  B11  C11  D11
resultado = pd.concat([df1, df2, df3])
print(resultado)
      A    B    C    D
0    A0   B0   C0   D0
1    A1   B1   C1   D1
2    A2   B2   C2   D2
3    A3   B3   C3   D3
4    A4   B4   C4   D4
5    A5   B5   C5   D5
6    A6   B6   C6   D6
7    A7   B7   C7   D7
8    A8   B8   C8   D8
9    A9   B9   C9   D9
10  A10  B10  C10  D10
11  A11  B11  C11  D11

** Ejemplo DataFrame por columnas**

df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
                    'B': ['B0', 'B1', 'B2', 'B3'],
                    'C': ['C0', 'C1', 'C2', 'C3'],
                    'D': ['D0', 'D1', 'D2', 'D3']},
                    index=[0, 1, 2, 3])
print(df1)
    A   B   C   D
0  A0  B0  C0  D0
1  A1  B1  C1  D1
2  A2  B2  C2  D2
3  A3  B3  C3  D3
df2 = pd.DataFrame({'E': ['A4', 'A5', 'A6', 'A7'],
                    'F': ['B4', 'B5', 'B6', 'B7'],
                    'G': ['C4', 'C5', 'C6', 'C7'],
                    'H': ['D4', 'D5', 'D6', 'D7']},                      
                    index=[0, 1, 2, 3])
print(df2)
    E   F   G   H
0  A4  B4  C4  D4
1  A5  B5  C5  D5
2  A6  B6  C6  D6
3  A7  B7  C7  D7
df3 = pd.DataFrame({'I': ['A8', 'A9', 'A10', 'A11'],
                    'J': ['B8', 'B9', 'B10', 'B11'],
                    'K': ['C8', 'C9', 'C10', 'C11'],
                    'L': ['D8', 'D9', 'D10', 'D11']},
                    index=[0, 1, 2, 3])
print(df3)
     I    J    K    L
0   A8   B8   C8   D8
1   A9   B9   C9   D9
2  A10  B10  C10  D10
3  A11  B11  C11  D11
resultado = pd.concat([df1, df2, df3],axis=1)
print(resultado)
    A   B   C   D   E   F   G   H    I    J    K    L
0  A0  B0  C0  D0  A4  B4  C4  D4   A8   B8   C8   D8
1  A1  B1  C1  D1  A5  B5  C5  D5   A9   B9   C9   D9
2  A2  B2  C2  D2  A6  B6  C6  D6  A10  B10  C10  D10
3  A3  B3  C3  D3  A7  B7  C7  D7  A11  B11  C11  D11

** Nota:** Que pasa si todas las columnas NO son iguales ** Ejemplo DataFrame por filas**

df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
                    'B': ['B0', 'B1', 'B2', 'B3'],
                    'C': ['C0', 'C1', 'C2', 'C3'],
                    'D': ['D0', 'D1', 'D2', 'D3']},
                    index=[0, 1, 2, 3])
print(df1)
    A   B   C   D
0  A0  B0  C0  D0
1  A1  B1  C1  D1
2  A2  B2  C2  D2
3  A3  B3  C3  D3
df2 = pd.DataFrame({'C': ['A4', 'A5', 'A6', 'A7'],
                    'D': ['B4', 'B5', 'B6', 'B7'],
                    'E': ['C4', 'C5', 'C6', 'C7'],
                    'F': ['D4', 'D5', 'D6', 'D7']},                      
                    index=[4, 5, 6, 7])
print(df2)
    C   D   E   F
4  A4  B4  C4  D4
5  A5  B5  C5  D5
6  A6  B6  C6  D6
7  A7  B7  C7  D7
resultado = pd.concat([df1, df2])
/anaconda3/bin/python3.6:1: FutureWarning: Sorting because non-concatenation axis is not aligned. A future version
of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=True'.

To retain the current behavior and silence the warning, pass sort=False
print(resultado)
     A    B   C   D    E    F
0   A0   B0  C0  D0  NaN  NaN
1   A1   B1  C1  D1  NaN  NaN
2   A2   B2  C2  D2  NaN  NaN
3   A3   B3  C3  D3  NaN  NaN
4  NaN  NaN  A4  B4   C4   D4
5  NaN  NaN  A5  B5   C5   D5
6  NaN  NaN  A6  B6   C6   D6
7  NaN  NaN  A7  B7   C7   D7
resultado = pd.concat([df1, df2],join='inner')
print(resultado)
    C   D
0  C0  D0
1  C1  D1
2  C2  D2
3  C3  D3
4  A4  B4
5  A5  B5
6  A6  B6
7  A7  B7
resultado = pd.concat([df1, df2],join='outer') # Por defecto es outer
print(resultado)
     A    B   C   D    E    F
0   A0   B0  C0  D0  NaN  NaN
1   A1   B1  C1  D1  NaN  NaN
2   A2   B2  C2  D2  NaN  NaN
3   A3   B3  C3  D3  NaN  NaN
4  NaN  NaN  A4  B4   C4   D4
5  NaN  NaN  A5  B5   C5   D5
6  NaN  NaN  A6  B6   C6   D6
7  NaN  NaN  A7  B7   C7   D7

Visualización de Datos

import matplotlib.pyplot as plt
import numpy as np

** Estilo clásico**

plt.style.use('classic')

** Ejemplos de gráficos de líneas (graficando funciones)**

# Datos del eje X para los siguientes gráficos
x = np.linspace(0, 10, 100)
print(x)
[ 0.          0.1010101   0.2020202   0.3030303   0.4040404   0.50505051
  0.60606061  0.70707071  0.80808081  0.90909091  1.01010101  1.11111111
  1.21212121  1.31313131  1.41414141  1.51515152  1.61616162  1.71717172
  1.81818182  1.91919192  2.02020202  2.12121212  2.22222222  2.32323232
  2.42424242  2.52525253  2.62626263  2.72727273  2.82828283  2.92929293
  3.03030303  3.13131313  3.23232323  3.33333333  3.43434343  3.53535354
  3.63636364  3.73737374  3.83838384  3.93939394  4.04040404  4.14141414
  4.24242424  4.34343434  4.44444444  4.54545455  4.64646465  4.74747475
  4.84848485  4.94949495  5.05050505  5.15151515  5.25252525  5.35353535
  5.45454545  5.55555556  5.65656566  5.75757576  5.85858586  5.95959596
  6.06060606  6.16161616  6.26262626  6.36363636  6.46464646  6.56565657
  6.66666667  6.76767677  6.86868687  6.96969697  7.07070707  7.17171717
  7.27272727  7.37373737  7.47474747  7.57575758  7.67676768  7.77777778
  7.87878788  7.97979798  8.08080808  8.18181818  8.28282828  8.38383838
  8.48484848  8.58585859  8.68686869  8.78787879  8.88888889  8.98989899
  9.09090909  9.19191919  9.29292929  9.39393939  9.49494949  9.5959596
  9.6969697   9.7979798   9.8989899  10.        ]

** Nota:** Lo siguiente se ejecuta todo junto

plt.plot(x, np.sin(x))
plt.plot(x, np.cos(x))
open_close_plot() # NO es necesario ponerlo en Jupiter o en Spider, solo en RStudio

Paneles de gráficos

plt.figure()  # crea la figura
# Crea el primer panel
plt.subplot(2, 1, 1) # (filas, columnas, número de paneles)
plt.plot(x, np.sin(x))
# crea el segundo panel
plt.subplot(2, 1, 2)
plt.plot(x, np.cos(x))
open_close_plot()

** Un estilo Orientado a Objetos para situaciones más complejas**

fig, ax = plt.subplots(2)
# Llama el método plot() method 
ax[0].plot(x, np.sin(x))
ax[1].plot(x, np.cos(x))
open_close_plot()

** Otro ejemplo Orientado a Objetos**

fig = plt.figure()
ax = plt.axes()
x = np.linspace(0, 10, 1000)
ax.plot(x, np.sin(x))
open_close_plot()

** Estilo funcional**

x = np.linspace(0, 10, 1000)
plt.plot(x, np.sin(x))
open_close_plot()

Usando colores

plt.plot(x, np.sin(x - 0), color='blue')        # Nombre del color
plt.plot(x, np.sin(x - 1), color='g')           # Código del color (rgbcmyk)
plt.plot(x, np.sin(x - 2), color='0.75')        # escala de gris entre 0 y 1
plt.plot(x, np.sin(x - 3), color='#FFDD44')     # Código exadecimal (RRGGBB from 00 to FF)
plt.plot(x, np.sin(x - 4), color=(1.0,0.2,0.3)) # Tupla RGB entre 0 y 1
plt.plot(x, np.sin(x - 5), color='chartreuse') # Nombres de color en HTML
open_close_plot()

plt.plot(x, x + 0, linestyle='solid')
plt.plot(x, x + 1, linestyle='dashed')
plt.plot(x, x + 2, linestyle='dashdot')
plt.plot(x, x + 3, linestyle='dotted')
open_close_plot()
# Lo mismo pero con código

plt.plot(x, x + 4, linestyle='-')  
plt.plot(x, x + 5, linestyle='--') 
plt.plot(x, x + 6, linestyle='-.') 
plt.plot(x, x + 7, linestyle=':') 
open_close_plot()

** Cambiando los límites de los ejes**

plt.plot(x, np.sin(x))
plt.xlim(-1, 11)
plt.ylim(-1.5, 1.5)
open_close_plot()

** Otro ejemplo**

plt.plot(x, np.sin(x))
plt.axis('tight')
open_close_plot()

Títulos

plt.plot(x, np.sin(x))
plt.title("Función Seno(x)")
plt.xlabel("x")
plt.ylabel("Seno(x)") 
open_close_plot()

Leyendas

plt.plot(x, np.sin(x), '-g', label='Seno(x)')
plt.plot(x, np.cos(x), ':b', label='Coseno(x)')
plt.axis('equal')
plt.legend()
open_close_plot()

** Orientado a Objetos**

ax = plt.axes()
ax.plot(x, np.sin(x))
ax.set(xlim=(0, 10), ylim=(-2, 2),
       xlabel='x', ylabel='Seno(x)',
       title='Un ploteo de Seno(x)')
open_close_plot()

** A la izquierda como función de matplotlib**

** A la derecha como método del objeto ax**

plt.xlabel() → ax.set_xlabel()
plt.ylabel() → ax.set_ylabel()
plt.xlim() → ax.set_xlim()
plt.ylim() → ax.set_ylim()
plt.title() → ax.set_title()

Gráficos scatter plot = Ejes XY

** Ejemplo**

x = np.linspace(0, 10, 30)
y = np.sin(x)
plt.plot(x, y, 'o', color='black')
open_close_plot()

** Ejemplo**

rng = np.random.RandomState(0)
for marca in ['o', '.', ',', 'x', '+', 'v', '^', '<', '>', 's', 'd']:
    plt.plot(rng.rand(5), rng.rand(5), marca,
             label="marca='{0}'".format(marca))
plt.legend(numpoints=1)
plt.xlim(0, 1.8)
open_close_plot()

** Ejemplo**

plt.plot(x, y, '-ok')
open_close_plot()

** Ejemplo**

plt.plot(x, y, '-p', color='gray',
         markersize=15, linewidth=4,
         markerfacecolor='white',
         markeredgecolor='gray',
         markeredgewidth=2)
plt.ylim(-1.2, 1.2)
open_close_plot()

** Comando scatter, más potente** ** Ejemplo**

plt.scatter(x, y, marker='o')
open_close_plot()

** Ejemplo**

rng = np.random.RandomState(0)
x = rng.randn(100)
y = rng.randn(100)
colores = rng.rand(100)
tamanos = 1000 * rng.rand(100)
plt.scatter(x, y, c=colores, s=tamanos, alpha=0.3,cmap='viridis')
plt.colorbar() 
open_close_plot()

** Ejemplo**

from sklearn.datasets import load_iris
iris = load_iris()
print(iris)
{'data': array([[5.1, 3.5, 1.4, 0.2],
       [4.9, 3. , 1.4, 0.2],
       [4.7, 3.2, 1.3, 0.2],
       [4.6, 3.1, 1.5, 0.2],
       [5. , 3.6, 1.4, 0.2],
       [5.4, 3.9, 1.7, 0.4],
       [4.6, 3.4, 1.4, 0.3],
       [5. , 3.4, 1.5, 0.2],
       [4.4, 2.9, 1.4, 0.2],
       [4.9, 3.1, 1.5, 0.1],
       [5.4, 3.7, 1.5, 0.2],
       [4.8, 3.4, 1.6, 0.2],
       [4.8, 3. , 1.4, 0.1],
       [4.3, 3. , 1.1, 0.1],
       [5.8, 4. , 1.2, 0.2],
       [5.7, 4.4, 1.5, 0.4],
       [5.4, 3.9, 1.3, 0.4],
       [5.1, 3.5, 1.4, 0.3],
       [5.7, 3.8, 1.7, 0.3],
       [5.1, 3.8, 1.5, 0.3],
       [5.4, 3.4, 1.7, 0.2],
       [5.1, 3.7, 1.5, 0.4],
       [4.6, 3.6, 1. , 0.2],
       [5.1, 3.3, 1.7, 0.5],
       [4.8, 3.4, 1.9, 0.2],
       [5. , 3. , 1.6, 0.2],
       [5. , 3.4, 1.6, 0.4],
       [5.2, 3.5, 1.5, 0.2],
       [5.2, 3.4, 1.4, 0.2],
       [4.7, 3.2, 1.6, 0.2],
       [4.8, 3.1, 1.6, 0.2],
       [5.4, 3.4, 1.5, 0.4],
       [5.2, 4.1, 1.5, 0.1],
       [5.5, 4.2, 1.4, 0.2],
       [4.9, 3.1, 1.5, 0.1],
       [5. , 3.2, 1.2, 0.2],
       [5.5, 3.5, 1.3, 0.2],
       [4.9, 3.1, 1.5, 0.1],
       [4.4, 3. , 1.3, 0.2],
       [5.1, 3.4, 1.5, 0.2],
       [5. , 3.5, 1.3, 0.3],
       [4.5, 2.3, 1.3, 0.3],
       [4.4, 3.2, 1.3, 0.2],
       [5. , 3.5, 1.6, 0.6],
       [5.1, 3.8, 1.9, 0.4],
       [4.8, 3. , 1.4, 0.3],
       [5.1, 3.8, 1.6, 0.2],
       [4.6, 3.2, 1.4, 0.2],
       [5.3, 3.7, 1.5, 0.2],
       [5. , 3.3, 1.4, 0.2],
       [7. , 3.2, 4.7, 1.4],
       [6.4, 3.2, 4.5, 1.5],
       [6.9, 3.1, 4.9, 1.5],
       [5.5, 2.3, 4. , 1.3],
       [6.5, 2.8, 4.6, 1.5],
       [5.7, 2.8, 4.5, 1.3],
       [6.3, 3.3, 4.7, 1.6],
       [4.9, 2.4, 3.3, 1. ],
       [6.6, 2.9, 4.6, 1.3],
       [5.2, 2.7, 3.9, 1.4],
       [5. , 2. , 3.5, 1. ],
       [5.9, 3. , 4.2, 1.5],
       [6. , 2.2, 4. , 1. ],
       [6.1, 2.9, 4.7, 1.4],
       [5.6, 2.9, 3.6, 1.3],
       [6.7, 3.1, 4.4, 1.4],
       [5.6, 3. , 4.5, 1.5],
       [5.8, 2.7, 4.1, 1. ],
       [6.2, 2.2, 4.5, 1.5],
       [5.6, 2.5, 3.9, 1.1],
       [5.9, 3.2, 4.8, 1.8],
       [6.1, 2.8, 4. , 1.3],
       [6.3, 2.5, 4.9, 1.5],
       [6.1, 2.8, 4.7, 1.2],
       [6.4, 2.9, 4.3, 1.3],
       [6.6, 3. , 4.4, 1.4],
       [6.8, 2.8, 4.8, 1.4],
       [6.7, 3. , 5. , 1.7],
       [6. , 2.9, 4.5, 1.5],
       [5.7, 2.6, 3.5, 1. ],
       [5.5, 2.4, 3.8, 1.1],
       [5.5, 2.4, 3.7, 1. ],
       [5.8, 2.7, 3.9, 1.2],
       [6. , 2.7, 5.1, 1.6],
       [5.4, 3. , 4.5, 1.5],
       [6. , 3.4, 4.5, 1.6],
       [6.7, 3.1, 4.7, 1.5],
       [6.3, 2.3, 4.4, 1.3],
       [5.6, 3. , 4.1, 1.3],
       [5.5, 2.5, 4. , 1.3],
       [5.5, 2.6, 4.4, 1.2],
       [6.1, 3. , 4.6, 1.4],
       [5.8, 2.6, 4. , 1.2],
       [5. , 2.3, 3.3, 1. ],
       [5.6, 2.7, 4.2, 1.3],
       [5.7, 3. , 4.2, 1.2],
       [5.7, 2.9, 4.2, 1.3],
       [6.2, 2.9, 4.3, 1.3],
       [5.1, 2.5, 3. , 1.1],
       [5.7, 2.8, 4.1, 1.3],
       [6.3, 3.3, 6. , 2.5],
       [5.8, 2.7, 5.1, 1.9],
       [7.1, 3. , 5.9, 2.1],
       [6.3, 2.9, 5.6, 1.8],
       [6.5, 3. , 5.8, 2.2],
       [7.6, 3. , 6.6, 2.1],
       [4.9, 2.5, 4.5, 1.7],
       [7.3, 2.9, 6.3, 1.8],
       [6.7, 2.5, 5.8, 1.8],
       [7.2, 3.6, 6.1, 2.5],
       [6.5, 3.2, 5.1, 2. ],
       [6.4, 2.7, 5.3, 1.9],
       [6.8, 3. , 5.5, 2.1],
       [5.7, 2.5, 5. , 2. ],
       [5.8, 2.8, 5.1, 2.4],
       [6.4, 3.2, 5.3, 2.3],
       [6.5, 3. , 5.5, 1.8],
       [7.7, 3.8, 6.7, 2.2],
       [7.7, 2.6, 6.9, 2.3],
       [6. , 2.2, 5. , 1.5],
       [6.9, 3.2, 5.7, 2.3],
       [5.6, 2.8, 4.9, 2. ],
       [7.7, 2.8, 6.7, 2. ],
       [6.3, 2.7, 4.9, 1.8],
       [6.7, 3.3, 5.7, 2.1],
       [7.2, 3.2, 6. , 1.8],
       [6.2, 2.8, 4.8, 1.8],
       [6.1, 3. , 4.9, 1.8],
       [6.4, 2.8, 5.6, 2.1],
       [7.2, 3. , 5.8, 1.6],
       [7.4, 2.8, 6.1, 1.9],
       [7.9, 3.8, 6.4, 2. ],
       [6.4, 2.8, 5.6, 2.2],
       [6.3, 2.8, 5.1, 1.5],
       [6.1, 2.6, 5.6, 1.4],
       [7.7, 3. , 6.1, 2.3],
       [6.3, 3.4, 5.6, 2.4],
       [6.4, 3.1, 5.5, 1.8],
       [6. , 3. , 4.8, 1.8],
       [6.9, 3.1, 5.4, 2.1],
       [6.7, 3.1, 5.6, 2.4],
       [6.9, 3.1, 5.1, 2.3],
       [5.8, 2.7, 5.1, 1.9],
       [6.8, 3.2, 5.9, 2.3],
       [6.7, 3.3, 5.7, 2.5],
       [6.7, 3. , 5.2, 2.3],
       [6.3, 2.5, 5. , 1.9],
       [6.5, 3. , 5.2, 2. ],
       [6.2, 3.4, 5.4, 2.3],
       [5.9, 3. , 5.1, 1.8]]), 'target': array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]), 'target_names': array(['setosa', 'versicolor', 'virginica'], dtype='<U10'), 'DESCR': 'Iris Plants Database\n====================\n\nNotes\n-----\nData Set Characteristics:\n    :Number of Instances: 150 (50 in each of three classes)\n    :Number of Attributes: 4 numeric, predictive attributes and the class\n    :Attribute Information:\n        - sepal length in cm\n        - sepal width in cm\n        - petal length in cm\n        - petal width in cm\n        - class:\n                - Iris-Setosa\n                - Iris-Versicolour\n                - Iris-Virginica\n    :Summary Statistics:\n\n    ============== ==== ==== ======= ===== ====================\n                    Min  Max   Mean    SD   Class Correlation\n    ============== ==== ==== ======= ===== ====================\n    sepal length:   4.3  7.9   5.84   0.83    0.7826\n    sepal width:    2.0  4.4   3.05   0.43   -0.4194\n    petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)\n    petal width:    0.1  2.5   1.20  0.76     0.9565  (high!)\n    ============== ==== ==== ======= ===== ====================\n\n    :Missing Attribute Values: None\n    :Class Distribution: 33.3% for each of 3 classes.\n    :Creator: R.A. Fisher\n    :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)\n    :Date: July, 1988\n\nThis is a copy of UCI ML iris datasets.\nhttp://archive.ics.uci.edu/ml/datasets/Iris\n\nThe famous Iris database, first used by Sir R.A Fisher\n\nThis is perhaps the best known database to be found in the\npattern recognition literature.  Fisher\'s paper is a classic in the field and\nis referenced frequently to this day.  (See Duda & Hart, for example.)  The\ndata set contains 3 classes of 50 instances each, where each class refers to a\ntype of iris plant.  One class is linearly separable from the other 2; the\nlatter are NOT linearly separable from each other.\n\nReferences\n----------\n   - Fisher,R.A. "The use of multiple measurements in taxonomic problems"\n     Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions to\n     Mathematical Statistics" (John Wiley, NY, 1950).\n   - Duda,R.O., & Hart,P.E. (1973) Pattern Classification and Scene Analysis.\n     (Q327.D83) John Wiley & Sons.  ISBN 0-471-22361-1.  See page 218.\n   - Dasarathy, B.V. (1980) "Nosing Around the Neighborhood: A New System\n     Structure and Classification Rule for Recognition in Partially Exposed\n     Environments".  IEEE Transactions on Pattern Analysis and Machine\n     Intelligence, Vol. PAMI-2, No. 1, 67-71.\n   - Gates, G.W. (1972) "The Reduced Nearest Neighbor Rule".  IEEE Transactions\n     on Information Theory, May 1972, 431-433.\n   - See also: 1988 MLC Proceedings, 54-64.  Cheeseman et al"s AUTOCLASS II\n     conceptual clustering system finds 3 classes in the data.\n   - Many, many more ...\n', 'feature_names': ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']}
caracteristicas = iris.data.T
print(caracteristicas)
[[5.1 4.9 4.7 4.6 5.  5.4 4.6 5.  4.4 4.9 5.4 4.8 4.8 4.3 5.8 5.7 5.4 5.1
  5.7 5.1 5.4 5.1 4.6 5.1 4.8 5.  5.  5.2 5.2 4.7 4.8 5.4 5.2 5.5 4.9 5.
  5.5 4.9 4.4 5.1 5.  4.5 4.4 5.  5.1 4.8 5.1 4.6 5.3 5.  7.  6.4 6.9 5.5
  6.5 5.7 6.3 4.9 6.6 5.2 5.  5.9 6.  6.1 5.6 6.7 5.6 5.8 6.2 5.6 5.9 6.1
  6.3 6.1 6.4 6.6 6.8 6.7 6.  5.7 5.5 5.5 5.8 6.  5.4 6.  6.7 6.3 5.6 5.5
  5.5 6.1 5.8 5.  5.6 5.7 5.7 6.2 5.1 5.7 6.3 5.8 7.1 6.3 6.5 7.6 4.9 7.3
  6.7 7.2 6.5 6.4 6.8 5.7 5.8 6.4 6.5 7.7 7.7 6.  6.9 5.6 7.7 6.3 6.7 7.2
  6.2 6.1 6.4 7.2 7.4 7.9 6.4 6.3 6.1 7.7 6.3 6.4 6.  6.9 6.7 6.9 5.8 6.8
  6.7 6.7 6.3 6.5 6.2 5.9]
 [3.5 3.  3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 3.7 3.4 3.  3.  4.  4.4 3.9 3.5
  3.8 3.8 3.4 3.7 3.6 3.3 3.4 3.  3.4 3.5 3.4 3.2 3.1 3.4 4.1 4.2 3.1 3.2
  3.5 3.1 3.  3.4 3.5 2.3 3.2 3.5 3.8 3.  3.8 3.2 3.7 3.3 3.2 3.2 3.1 2.3
  2.8 2.8 3.3 2.4 2.9 2.7 2.  3.  2.2 2.9 2.9 3.1 3.  2.7 2.2 2.5 3.2 2.8
  2.5 2.8 2.9 3.  2.8 3.  2.9 2.6 2.4 2.4 2.7 2.7 3.  3.4 3.1 2.3 3.  2.5
  2.6 3.  2.6 2.3 2.7 3.  2.9 2.9 2.5 2.8 3.3 2.7 3.  2.9 3.  3.  2.5 2.9
  2.5 3.6 3.2 2.7 3.  2.5 2.8 3.2 3.  3.8 2.6 2.2 3.2 2.8 2.8 2.7 3.3 3.2
  2.8 3.  2.8 3.  2.8 3.8 2.8 2.8 2.6 3.  3.4 3.1 3.  3.1 3.1 3.1 2.7 3.2
  3.3 3.  2.5 3.  3.4 3. ]
 [1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 1.5 1.6 1.4 1.1 1.2 1.5 1.3 1.4
  1.7 1.5 1.7 1.5 1.  1.7 1.9 1.6 1.6 1.5 1.4 1.6 1.6 1.5 1.5 1.4 1.5 1.2
  1.3 1.5 1.3 1.5 1.3 1.3 1.3 1.6 1.9 1.4 1.6 1.4 1.5 1.4 4.7 4.5 4.9 4.
  4.6 4.5 4.7 3.3 4.6 3.9 3.5 4.2 4.  4.7 3.6 4.4 4.5 4.1 4.5 3.9 4.8 4.
  4.9 4.7 4.3 4.4 4.8 5.  4.5 3.5 3.8 3.7 3.9 5.1 4.5 4.5 4.7 4.4 4.1 4.
  4.4 4.6 4.  3.3 4.2 4.2 4.2 4.3 3.  4.1 6.  5.1 5.9 5.6 5.8 6.6 4.5 6.3
  5.8 6.1 5.1 5.3 5.5 5.  5.1 5.3 5.5 6.7 6.9 5.  5.7 4.9 6.7 4.9 5.7 6.
  4.8 4.9 5.6 5.8 6.1 6.4 5.6 5.1 5.6 6.1 5.6 5.5 4.8 5.4 5.6 5.1 5.1 5.9
  5.7 5.2 5.  5.2 5.4 5.1]
 [0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 0.2 0.2 0.1 0.1 0.2 0.4 0.4 0.3
  0.3 0.3 0.2 0.4 0.2 0.5 0.2 0.2 0.4 0.2 0.2 0.2 0.2 0.4 0.1 0.2 0.1 0.2
  0.2 0.1 0.2 0.2 0.3 0.3 0.2 0.6 0.4 0.3 0.2 0.2 0.2 0.2 1.4 1.5 1.5 1.3
  1.5 1.3 1.6 1.  1.3 1.4 1.  1.5 1.  1.4 1.3 1.4 1.5 1.  1.5 1.1 1.8 1.3
  1.5 1.2 1.3 1.4 1.4 1.7 1.5 1.  1.1 1.  1.2 1.6 1.5 1.6 1.5 1.3 1.3 1.3
  1.2 1.4 1.2 1.  1.3 1.2 1.3 1.3 1.1 1.3 2.5 1.9 2.1 1.8 2.2 2.1 1.7 1.8
  1.8 2.5 2.  1.9 2.1 2.  2.4 2.3 1.8 2.2 2.3 1.5 2.3 2.  2.  1.8 2.1 1.8
  1.8 1.8 2.1 1.6 1.9 2.  2.2 1.5 1.4 2.3 2.4 1.8 1.8 2.1 2.4 2.3 1.9 2.3
  2.5 2.3 1.9 2.  2.3 1.8]]
plt.scatter(caracteristicas[0], caracteristicas[1], 
            alpha=0.2,s=100*caracteristicas[3], c=iris.target, cmap='viridis')
plt.xlabel(iris.feature_names[0])
plt.ylabel(iris.feature_names[1])
open_close_plot()

Ploteando Histogramas y Densidad

plt.style.use('seaborn-white')
datos = np.random.randn(1000)
plt.hist(datos)
open_close_plot()

plt.hist(datos, bins=30, normed=True, alpha=0.5,
         histtype='stepfilled', color='steelblue',
         edgecolor='none')
/anaconda3/lib/python3.6/site-packages/matplotlib/axes/_axes.py:6462: UserWarning: The 'normed' kwarg is deprecated, and has been replaced by the 'density' kwarg.
  warnings.warn("The 'normed' kwarg is deprecated, and has been "
open_close_plot()

Tests de normalidad

import scipy.stats

Para pruebas de normalidad siempre se plantean así las hipótesis.

Hipótesis:

H0: La muestra proviene de una distribución normal.

H1: La muestra no proviene de una distribución normal.

Nivel de Significancia: El nivel de significancia que se trabajará es de 0.05. Alpha=0.05

Criterio de Decisión

Si P < Alpha Se rechaza H0

Si p >= Alpha No se rechaza H0, es decir, los datos SÍ siguen la normal

** Test de Shapiro-Wilk**

shapiro_resultados = scipy.stats.shapiro(datos)
print(shapiro_resultados)
(0.9985635280609131, 0.5966154336929321)
p_value = shapiro_resultados[1]
print(p_value)
# interpretación
0.5966154336929321
alpha = 0.05
if p_value > alpha:
    print('Sí sigue la curva Normal (No se rechaza H0)')
else:
    print('No sigue la curva Normal (Se rechaza H0)')
Sí sigue la curva Normal (No se rechaza H0)

** Otra Forma Gráfica:** Si los puntos se aproximan a la recta significa que los datos sí siguen la normal.

# QQ Plot
from statsmodels.graphics.gofplots import qqplot
from matplotlib import pyplot
# q-q plot
qqplot(datos, line='s')
open_close_plot()

** Test de Kolmogorov-Smirnov**

ks_resultados = scipy.stats.kstest(datos, cdf='norm')
print(ks_resultados)
KstestResult(statistic=0.025100443428763608, pvalue=0.5543510537951236)
p_value = ks_resultados[1]
print(p_value)
# interpretación
0.5543510537951236
alpha = 0.05
if p_value > alpha:
    print('Sí sigue la curva Normal (No se rechaza H0)')
else:
    print('No sigue la curva Normal (Se rechaza H0)')
Sí sigue la curva Normal (No se rechaza H0)

Ejemplo

datos = np.random.normal(0, 0.8, 1000)
plt.hist(datos)
plt.hist(datos, bins=30, normed=True, alpha=0.5,
         histtype='stepfilled', color='steelblue',
         edgecolor='none')
/anaconda3/lib/python3.6/site-packages/matplotlib/axes/_axes.py:6462: UserWarning: The 'normed' kwarg is deprecated, and has been replaced by the 'density' kwarg.
  warnings.warn("The 'normed' kwarg is deprecated, and has been "
open_close_plot()

** Otra Forma Gráfica:** Si los puntos se aproximan a la recta significa que los datos sí siguen la normal.

# QQ Plot
from statsmodels.graphics.gofplots import qqplot
from matplotlib import pyplot
# q-q plot
qqplot(datos, line='s')
open_close_plot()

** Test de Shapiro-Wilk**

shapiro_resultados = scipy.stats.shapiro(datos)
print(shapiro_resultados)
(0.9991971254348755, 0.954029381275177)
p_value = shapiro_resultados[1]
print(p_value)
# interpretación
0.954029381275177
alpha = 0.05
if p_value > alpha:
    print('Sí sigue la curva Normal (No se rechaza H0)')
else:
    print('No sigue la curva Normal (Se rechaza H0)')
Sí sigue la curva Normal (No se rechaza H0)

** Test de Kolmogorov-Smirnov**

ks_resultados = scipy.stats.kstest(datos, cdf='norm')
print(ks_resultados)
KstestResult(statistic=0.06610914241339, pvalue=0.00030403097442555094)
p_value = ks_resultados[1]
print(p_value)
# interpretación
0.00030403097442555094
alpha = 0.05
if p_value > alpha:
    print('Sí sigue la curva Normal (No se rechaza H0)')
else:
    print('No sigue la curva Normal (Se rechaza H0)')
No sigue la curva Normal (Se rechaza H0)

Ejemplo

x1 = np.random.normal(0, 0.8, 1000)
x2 = np.random.normal(-2, 1, 1000)
x3 = np.random.normal(3, 2, 1000)
estilo = dict(histtype='stepfilled', alpha=0.3, normed=True, bins=40)
plt.hist(x1, ** estilo)
/anaconda3/lib/python3.6/site-packages/matplotlib/axes/_axes.py:6462: UserWarning: The 'normed' kwarg is deprecated, and has been replaced by the 'density' kwarg.
  warnings.warn("The 'normed' kwarg is deprecated, and has been "
open_close_plot()

plt.hist(x2, ** estilo)
/anaconda3/lib/python3.6/site-packages/matplotlib/axes/_axes.py:6462: UserWarning: The 'normed' kwarg is deprecated, and has been replaced by the 'density' kwarg.
  warnings.warn("The 'normed' kwarg is deprecated, and has been "
open_close_plot()

plt.hist(x3, ** estilo)
/anaconda3/lib/python3.6/site-packages/matplotlib/axes/_axes.py:6462: UserWarning: The 'normed' kwarg is deprecated, and has been replaced by the 'density' kwarg.
  warnings.warn("The 'normed' kwarg is deprecated, and has been "
open_close_plot()

Personalizando leyendas, estilo Orientado a Objetos

** Ejemplo**

x = np.linspace(0, 10, 1000)
fig, ax = plt.subplots()
ax.plot(x, np.sin(x), '-b', label='Seno')
ax.plot(x, np.cos(x), '--r', label='Coseno')
ax.axis('equal')
leg = ax.legend()
plt.show()

** Ejemplo**

Se cambia un atributo al objeto fig y se plotea de nuevo

ax.legend(frameon=False, loc='lower center', ncol=2)
plt.show()

** Ejemplo**

Se cambia un atributo al objeto fig y se plotea de nuevo

ax.legend(fancybox=True, framealpha=1, shadow=True, borderpad=1)
open_close_plot()

Varios gráficos a la vez

** Ejemplo**

ax1 = plt.axes()  # Ejes
ax2 = plt.axes([0.65, 0.65, 0.2, 0.2])
open_close_plot()

** Ejemplo**

fig = plt.figure()
ax1 = fig.add_axes([0.1, 0.5, 0.8, 0.4],
                   xticklabels=[], ylim=(-1.2, 1.2))
ax2 = fig.add_axes([0.1, 0.1, 0.8, 0.4],
                   ylim=(-1.2, 1.2))
x = np.linspace(0, 10)
ax1.plot(np.sin(x))
ax2.plot(np.cos(x))
open_close_plot()

** Ejemplo**

for i in range(1, 7):
    plt.subplot(2, 3, i)
    plt.text(0.5, 0.5, str((2, 3, i)),fontsize=18, ha='center')
open_close_plot()       

** Ejemplo**

fig, ax = plt.subplots(2, 3, sharex='col', sharey='row')
# Los ejes son arreglos bidimensionales [i, j]
for i in range(2):
    for j in range(3):
        ax[i, j].text(0.5, 0.5, str((i, j)),
                      fontsize=18, ha='center')
open_close_plot() 

** Ejemplo Dígitos, los vamos a usar más adelante en el curso**

from sklearn.datasets import load_digits
digitos = load_digits(n_class=6)
print(digitos)
{'data': array([[ 0.,  0.,  5., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ..., 10.,  0.,  0.],
       [ 0.,  0.,  0., ..., 16.,  9.,  0.],
       ...,
       [ 0.,  0.,  0., ...,  9.,  0.,  0.],
       [ 0.,  0.,  0., ...,  4.,  0.,  0.],
       [ 0.,  0.,  6., ...,  6.,  0.,  0.]]), 'target': array([0, 1, 2, ..., 4, 4, 0]), 'target_names': array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]), 'images': array([[[ 0.,  0.,  5., ...,  1.,  0.,  0.],
        [ 0.,  0., 13., ..., 15.,  5.,  0.],
        [ 0.,  3., 15., ..., 11.,  8.,  0.],
        ...,
        [ 0.,  4., 11., ..., 12.,  7.,  0.],
        [ 0.,  2., 14., ..., 12.,  0.,  0.],
        [ 0.,  0.,  6., ...,  0.,  0.,  0.]],

       [[ 0.,  0.,  0., ...,  5.,  0.,  0.],
        [ 0.,  0.,  0., ...,  9.,  0.,  0.],
        [ 0.,  0.,  3., ...,  6.,  0.,  0.],
        ...,
        [ 0.,  0.,  1., ...,  6.,  0.,  0.],
        [ 0.,  0.,  1., ...,  6.,  0.,  0.],
        [ 0.,  0.,  0., ..., 10.,  0.,  0.]],

       [[ 0.,  0.,  0., ..., 12.,  0.,  0.],
        [ 0.,  0.,  3., ..., 14.,  0.,  0.],
        [ 0.,  0.,  8., ..., 16.,  0.,  0.],
        ...,
        [ 0.,  9., 16., ...,  0.,  0.,  0.],
        [ 0.,  3., 13., ..., 11.,  5.,  0.],
        [ 0.,  0.,  0., ..., 16.,  9.,  0.]],

       ...,

       [[ 0.,  0.,  0., ...,  6.,  0.,  0.],
        [ 0.,  0.,  0., ...,  2.,  0.,  0.],
        [ 0.,  0.,  8., ...,  1.,  2.,  0.],
        ...,
        [ 0., 12., 16., ..., 16.,  1.,  0.],
        [ 0.,  1.,  7., ..., 13.,  0.,  0.],
        [ 0.,  0.,  0., ...,  9.,  0.,  0.]],

       [[ 0.,  0.,  0., ...,  4.,  0.,  0.],
        [ 0.,  0.,  4., ...,  0.,  0.,  0.],
        [ 0.,  0., 12., ...,  4.,  3.,  0.],
        ...,
        [ 0., 12., 16., ..., 13.,  0.,  0.],
        [ 0.,  0.,  4., ...,  8.,  0.,  0.],
        [ 0.,  0.,  0., ...,  4.,  0.,  0.]],

       [[ 0.,  0.,  6., ..., 11.,  1.,  0.],
        [ 0.,  0., 16., ..., 16.,  1.,  0.],
        [ 0.,  3., 16., ..., 13.,  6.,  0.],
        ...,
        [ 0.,  5., 16., ..., 16.,  5.,  0.],
        [ 0.,  1., 15., ..., 16.,  1.,  0.],
        [ 0.,  0.,  6., ...,  6.,  0.,  0.]]]), 'DESCR': "Optical Recognition of Handwritten Digits Data Set\n===================================================\n\nNotes\n-----\nData Set Characteristics:\n    :Number of Instances: 5620\n    :Number of Attributes: 64\n    :Attribute Information: 8x8 image of integer pixels in the range 0..16.\n    :Missing Attribute Values: None\n    :Creator: E. Alpaydin (alpaydin '@' boun.edu.tr)\n    :Date: July; 1998\n\nThis is a copy of the test set of the UCI ML hand-written digits datasets\nhttp://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+Digits\n\nThe data set contains images of hand-written digits: 10 classes where\neach class refers to a digit.\n\nPreprocessing programs made available by NIST were used to extract\nnormalized bitmaps of handwritten digits from a preprinted form. From a\ntotal of 43 people, 30 contributed to the training set and different 13\nto the test set. 32x32 bitmaps are divided into nonoverlapping blocks of\n4x4 and the number of on pixels are counted in each block. This generates\nan input matrix of 8x8 where each element is an integer in the range\n0..16. This reduces dimensionality and gives invariance to small\ndistortions.\n\nFor info on NIST preprocessing routines, see M. D. Garris, J. L. Blue, G.\nT. Candela, D. L. Dimmick, J. Geist, P. J. Grother, S. A. Janet, and C.\nL. Wilson, NIST Form-Based Handprint Recognition System, NISTIR 5469,\n1994.\n\nReferences\n----------\n  - C. Kaynak (1995) Methods of Combining Multiple Classifiers and Their\n    Applications to Handwritten Digit Recognition, MSc Thesis, Institute of\n    Graduate Studies in Science and Engineering, Bogazici University.\n  - E. Alpaydin, C. Kaynak (1998) Cascading Classifiers, Kybernetika.\n  - Ken Tang and Ponnuthurai N. Suganthan and Xi Yao and A. Kai Qin.\n    Linear dimensionalityreduction using relevance weighted LDA. School of\n    Electrical and Electronic Engineering Nanyang Technological University.\n    2005.\n  - Claudio Gentile. A New Approximate Maximal Margin Classification\n    Algorithm. NIPS. 2000.\n"}
fig, ax = plt.subplots(8, 8, figsize=(6, 6))
for i, axi in enumerate(ax.flat):
    axi.imshow(digitos.images[i], cmap='binary')
    axi.set(xticks=[], yticks=[])
open_close_plot()

Gráficos 3D

import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D # Para una diferente version de matplotlib

** Ejemplo**

fig = plt.figure()
ax = Axes3D(fig)
open_close_plot()

** Ejemplo**

fig = plt.figure()
ax = Axes3D(fig)
# Datos para la línea en 3D
zline = np.linspace(0, 15, 1000)
xline = np.sin(zline)
yline = np.cos(zline)
ax.plot3D(xline, yline, zline, 'gray')
open_close_plot()

** Datos para los puntos**

fig = plt.figure()
ax = Axes3D(fig)
zdata = 15 * np.random.random(100)
xdata = np.sin(zdata) + 0.1 * np.random.randn(100)
ydata = np.cos(zdata) + 0.1 * np.random.randn(100)
ax.scatter3D(xdata, ydata, zdata, c=zdata, cmap='Greens')
open_close_plot()

** El paquete Seaborn**

import seaborn as sns
iris = sns.load_dataset("iris")
print(iris.head())
   sepal_length  sepal_width  petal_length  petal_width species
0           5.1          3.5           1.4          0.2  setosa
1           4.9          3.0           1.4          0.2  setosa
2           4.7          3.2           1.3          0.2  setosa
3           4.6          3.1           1.5          0.2  setosa
4           5.0          3.6           1.4          0.2  setosa
corr = sns.pairplot(iris, hue='species', size=2.5)
open_close_plot()


Análisis Exploratorio Básico

Paso 1: Cargar la tabla de datos

import pandas as pd
import prince
import os
import pandas as pd
import numpy as np
os.chdir("/Users/oldemarrodriguez/Google Drive/MDCurso/Datos")
print(os.getcwd())
/Users/oldemarrodriguez/Google Drive/MDCurso/Datos
datos = pd.read_csv('SAheart.csv',delimiter=';',decimal=".")
print(datos.head())
   sbp  tobacco   ldl  adiposity  famhist  typea  obesity  alcohol  age chd
0  160    12.00  5.73      23.11  Present     49    25.30    97.20   52  Si
1  144     0.01  4.41      28.61   Absent     55    28.87     2.06   63  Si
2  118     0.08  3.48      32.28  Present     52    29.14     3.81   46  No
3  170     7.50  6.41      38.03  Present     51    31.99    24.26   58  Si
4  134    13.60  3.50      27.78  Present     60    25.99    57.34   49  Si
print(datos.shape)
(462, 10)

Paso 2: Presentación de estadísticas básicas

** describe() es como el summary de R para las variables numéricas**

print(datos.dropna().describe())
              sbp     tobacco     ...         alcohol         age
count  462.000000  462.000000     ...      462.000000  462.000000
mean   138.326840    3.635649     ...       17.044394   42.816017
std     20.496317    4.593024     ...       24.481059   14.608956
min    101.000000    0.000000     ...        0.000000   15.000000
25%    124.000000    0.052500     ...        0.510000   31.000000
50%    134.000000    2.000000     ...        7.510000   45.000000
75%    148.000000    5.500000     ...       23.892500   55.000000
max    218.000000   31.200000     ...      147.190000   64.000000

[8 rows x 8 columns]
print(datos.describe())
              sbp     tobacco     ...         alcohol         age
count  462.000000  462.000000     ...      462.000000  462.000000
mean   138.326840    3.635649     ...       17.044394   42.816017
std     20.496317    4.593024     ...       24.481059   14.608956
min    101.000000    0.000000     ...        0.000000   15.000000
25%    124.000000    0.052500     ...        0.510000   31.000000
50%    134.000000    2.000000     ...        7.510000   45.000000
75%    148.000000    5.500000     ...       23.892500   55.000000
max    218.000000   31.200000     ...      147.190000   64.000000

[8 rows x 8 columns]
print(datos.mean(numeric_only=True))
sbp          138.326840
tobacco        3.635649
ldl            4.740325
adiposity     25.406732
typea         53.103896
obesity       26.044113
alcohol       17.044394
age           42.816017
dtype: float64
print(datos.median(numeric_only=True))
sbp          134.000
tobacco        2.000
ldl            4.340
adiposity     26.115
typea         53.000
obesity       25.805
alcohol        7.510
age           45.000
dtype: float64
print(datos.std(numeric_only=True))
sbp          20.496317
tobacco       4.593024
ldl           2.070909
adiposity     7.780699
typea         9.817534
obesity       4.213680
alcohol      24.481059
age          14.608956
dtype: float64
print(datos.max(numeric_only=True))
sbp          218.00
tobacco       31.20
ldl           15.33
adiposity     42.49
typea         78.00
obesity       46.58
alcohol      147.19
age           64.00
dtype: float64

** Los percentiles**

print(datos.quantile(np.array([0,.25,.50,.75,1])))
        sbp  tobacco      ldl  adiposity  typea  obesity   alcohol   age
0.00  101.0   0.0000   0.9800     6.7400   13.0  14.7000    0.0000  15.0
0.25  124.0   0.0525   3.2825    19.7750   47.0  22.9850    0.5100  31.0
0.50  134.0   2.0000   4.3400    26.1150   53.0  25.8050    7.5100  45.0
0.75  148.0   5.5000   5.7900    31.2275   60.0  28.4975   23.8925  55.0
1.00  218.0  31.2000  15.3300    42.4900   78.0  46.5800  147.1900  64.0

** Contando datos en las variables categóricas**

print(pd.crosstab(index=datos["chd"],columns="count"))
col_0  count
chd         
No       302
Si       160
print(pd.crosstab(index=datos["famhist"],columns="count"))
col_0    count
famhist       
Absent     270
Present    192

** Otra forma**

print(datos['chd'].value_counts())
No    302
Si    160
Name: chd, dtype: int64
print(datos["famhist"].value_counts())
Absent     270
Present    192
Name: famhist, dtype: int64

** Tabla cruzada**

famhist_chd = pd.crosstab(index=datos["famhist"], columns=datos["chd"])
print(famhist_chd)
chd       No  Si
famhist         
Absent   206  64
Present   96  96
famhist_chd.index = ["Absent","Present"]
print(famhist_chd)
chd       No  Si
Absent   206  64
Present   96  96

** Otra forma**

g_chd = pd.crosstab(index=datos["chd"],columns="count") 
print(g_chd) 
col_0  count
chd         
No       302
Si       160
print(g_chd['count'][0])
302
print(g_chd['count'][1])
160
g_famhist = pd.crosstab(index=datos["famhist"],columns="count") 
print(g_famhist)
col_0    count
famhist       
Absent     270
Present    192
print(g_famhist['count'][0])
270
print(g_famhist['count'][1])
192

Paso 3: Gráficos importantes

** Gráfico chd**

import matplotlib.pyplot as plt
alto = [g_chd['count'][0], g_chd['count'][1]]
barras = ('No', 'Sí')
y_pos = np.arange(len(barras))
plt.bar(y_pos, alto, color=['red','blue'])
plt.xticks(y_pos, barras)
open_close_plot()

** Gráfico famhist**

alto = [g_famhist['count'][0], g_famhist['count'][1]]
barras = ('Absent ', 'Present')
y_pos = np.arange(len(barras))
plt.bar(y_pos, alto, color=['red','blue'])
plt.xticks(y_pos, barras)
open_close_plot()

** Box Plots**

datos.head()
boxplots = datos.boxplot(return_type='axes')
open_close_plot()

** Función de densidad**

densidad = datos[datos.columns[:1]].plot(kind='density')
open_close_plot()

densidad = datos[datos.columns[8:9]].plot(kind='density')
open_close_plot()

densidad = datos['age'].plot(kind='density')
open_close_plot()

densidad = datos[datos.columns[:10]].plot(kind='density')
open_close_plot()

** Histogramas**

densidad = datos[datos.columns[:1]].plot(kind='hist')
open_close_plot()

densidad = datos[datos.columns[8:9]].plot(kind='hist')
open_close_plot()

densidad = datos['age'].plot(kind='hist')
open_close_plot()

densidad = datos[datos.columns[:10]].plot(kind='hist')
open_close_plot()

Gráfico de todas las variables 2 a 2

import seaborn as sns
import matplotlib.pyplot as plt
sns.pairplot(datos, hue='chd', size=2.5)
open_close_plot()

sns.pairplot(datos, hue='famhist', size=2.5)
open_close_plot()

Calculando correlaciones

** Nota:** Es “inteligente” e ingnora las variables categóricas

corr = datos.corr()
print(corr)
                sbp   tobacco       ldl    ...      obesity   alcohol       age
sbp        1.000000  0.212247  0.158296    ...     0.238067  0.140096  0.388771
tobacco    0.212247  1.000000  0.158905    ...     0.124529  0.200813  0.450330
ldl        0.158296  0.158905  1.000000    ...     0.330506 -0.033403  0.311799
adiposity  0.356500  0.286640  0.440432    ...     0.716556  0.100330  0.625954
typea     -0.057454 -0.014608  0.044048    ...     0.074006  0.039498 -0.102606
obesity    0.238067  0.124529  0.330506    ...     1.000000  0.051620  0.291777
alcohol    0.140096  0.200813 -0.033403    ...     0.051620  1.000000  0.101125
age        0.388771  0.450330  0.311799    ...     0.291777  0.101125  1.000000

[8 rows x 8 columns]
f, ax = plt.subplots(figsize=(10, 8))
sns.heatmap(corr, mask=np.zeros_like(corr, dtype=np.bool), cmap=sns.diverging_palette(220, 10, as_cmap=True),
            square=True, ax=ax)
open_close_plot()

Recodificando variables

os.chdir("/Users/oldemarrodriguez/Google Drive/MDCurso/Datos")
#print(os.getcwd())
datos = pd.read_csv('SAheart.csv',delimiter=';',decimal=".")
print(datos.head())
   sbp  tobacco   ldl  adiposity  famhist  typea  obesity  alcohol  age chd
0  160    12.00  5.73      23.11  Present     49    25.30    97.20   52  Si
1  144     0.01  4.41      28.61   Absent     55    28.87     2.06   63  Si
2  118     0.08  3.48      32.28  Present     52    29.14     3.81   46  No
3  170     7.50  6.41      38.03  Present     51    31.99    24.26   58  Si
4  134    13.60  3.50      27.78  Present     60    25.99    57.34   49  Si

** Conviertiendo una categoría en números**

print(pd.value_counts(datos["chd"]))
No    302
Si    160
Name: chd, dtype: int64

** Equivalente**

print(datos['chd'].value_counts())
No    302
Si    160
Name: chd, dtype: int64

** La siguiente función recodifica usando pandas una categoría con números**

** Nota:** Esto NO convierte la variable en numérica.

def recodificar(col, nuevo_codigo):
  col_cod = pd.Series(col, copy=True)
  for llave, valor in nuevo_codigo.items():
    col_cod.replace(llave, valor, inplace=True)
  return col_cod
datos["chd"] = recodificar(datos["chd"], {'No':0,'Si':1})
print(datos.head())
   sbp  tobacco   ldl  adiposity ...  obesity  alcohol  age  chd
0  160    12.00  5.73      23.11 ...    25.30    97.20   52    1
1  144     0.01  4.41      28.61 ...    28.87     2.06   63    1
2  118     0.08  3.48      32.28 ...    29.14     3.81   46    0
3  170     7.50  6.41      38.03 ...    31.99    24.26   58    1
4  134    13.60  3.50      27.78 ...    25.99    57.34   49    1

[5 rows x 10 columns]

** Luego de recoficar**

print(pd.value_counts(datos["chd"]))
0    302
1    160
Name: chd, dtype: int64

** Equivalente**

print(datos['chd'].value_counts())
0    302
1    160
Name: chd, dtype: int64

** A la inversa: Conviertiendo un número en una categoría**

datos["chd"] = recodificar(datos["chd"], {0:'No',1:'Si'})
print(datos.head())
   sbp  tobacco   ldl  adiposity  famhist  typea  obesity  alcohol  age chd
0  160    12.00  5.73      23.11  Present     49    25.30    97.20   52  Si
1  144     0.01  4.41      28.61   Absent     55    28.87     2.06   63  Si
2  118     0.08  3.48      32.28  Present     52    29.14     3.81   46  No
3  170     7.50  6.41      38.03  Present     51    31.99    24.26   58  Si
4  134    13.60  3.50      27.78  Present     60    25.99    57.34   49  Si

Análisis en Componentes Principales - ACP

import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA

Ejemplo 1

from sklearn.datasets import load_digits
digits = load_digits()
print(digits.data.shape)
(1797, 64)
pca = PCA(2)  # Reduce las dimensiones a 2
componentes = pca.fit_transform(digits.data)
print(digits.data.shape)
(1797, 64)
print(componentes.shape)
(1797, 2)
plt.scatter(componentes[:, 0], componentes[:, 1],
            c=digits.target, edgecolor='none', alpha=0.5,
            cmap=plt.cm.get_cmap('viridis', 10))
plt.xlabel('componente 1')
plt.ylabel('componente 2')
plt.colorbar()
open_close_plot()

Ejemplo 3

import os
import pandas as pd
os.chdir("/Users/oldemarrodriguez/Google Drive/MDCurso/Datos")
#print(os.getcwd())
datos = pd.read_csv('EjemploEstudiantes.csv',delimiter=';',decimal=",",index_col=0)
print(datos)
        Matematicas  Ciencias  Espanol  Historia  EdFisica
Lucia           7.0       6.5      9.2       8.6       8.0
Pedro           7.5       9.4      7.3       7.0       7.0
Ines            7.6       9.2      8.0       8.0       7.5
Luis            5.0       6.5      6.5       7.0       9.0
Andres          6.0       6.0      7.8       8.9       7.3
Ana             7.8       9.6      7.7       8.0       6.5
Carlos          6.3       6.4      8.2       9.0       7.2
Jose            7.9       9.7      7.5       8.0       6.0
Sonia           6.0       6.0      6.5       5.5       8.7
Maria           6.8       7.2      8.7       9.0       7.0
print(datos.head())
        Matematicas  Ciencias  Espanol  Historia  EdFisica
Lucia           7.0       6.5      9.2       8.6       8.0
Pedro           7.5       9.4      7.3       7.0       7.0
Ines            7.6       9.2      8.0       8.0       7.5
Luis            5.0       6.5      6.5       7.0       9.0
Andres          6.0       6.0      7.8       8.9       7.3
print(datos.shape)
(10, 5)
pca = PCA(n_components=2)
componentes = pca.fit_transform(datos)
print(componentes)
[[-0.76471745 -1.5817637 ]
 [ 1.66887794  1.39196556]
 [ 1.57822841  0.29949595]
 [-2.60701317  1.32020402]
 [-1.43877557 -1.33566867]
 [ 2.34790534  0.3880845 ]
 [-0.89372557 -1.51890124]
 [ 2.64984571  0.4254636 ]
 [-2.62959083  2.18339513]
 [ 0.08896518 -1.57227516]]
print(datos.shape)
(10, 5)
print(componentes.shape)
(10, 2)
plt.scatter(componentes[:, 0], componentes[:, 1])
plt.xlabel('componente 1')
plt.ylabel('componente 2')
open_close_plot()

ACP con el paquete “prince”

** En Mac (Terminal):**

pip install git+https://github.com/MaxHalford/Prince

** En Windows (Anaconda Prompt): **

pip install git+https://github.com/MaxHalford/Prince

** Prince en githup:** + https://github.com/MaxHalford/prince

Instalando paquetes

https://docs.python.org/3/installing/

import prince

Clase ACP

import matplotlib.pyplot as plt
from prince import PCA
class ACP:
    def __init__(self, datos, n_componentes = 5): 
        self.__datos = datos
        self.__modelo = PCA(n_components = n_componentes).fit(self.__datos)
        self.__correlacion_var = self.__modelo.column_correlations(datos)
        self.__coordenadas_ind = self.__modelo.row_coordinates(datos)
        self.__contribucion_ind = self.__modelo.row_contributions(datos)
        self.__cos2_ind = self.__modelo.row_cosine_similarities(datos)
        self.__var_explicada = [x * 100 for x in self.__modelo.explained_inertia_]
    @property
    def datos(self):
        return self.__datos
    @datos.setter
    def datos(self, datos):
        self.__datos = datos
    @property
    def modelo(self):
        return self.__modelo
    @property
    def correlacion_var(self):
        return self.__correlacion_var
    @property
    def coordenadas_ind(self):
        return self.__coordenadas_ind
    @property
    def contribucion_ind(self):
        return self.__contribucion_ind
    @property
    def cos2_ind(self):
        return self.__cos2_ind
    @property
    def var_explicada(self):
        return self.__var_explicada
        self.__var_explicada = var_explicada
    def plot_plano_principal(self, ejes = [0, 1], ind_labels = True, titulo = 'Plano Principal'):
        x = self.coordenadas_ind[ejes[0]].values
        y = self.coordenadas_ind[ejes[1]].values
        plt.style.use('seaborn-whitegrid')
        plt.scatter(x, y, color = 'gray')
        plt.title(titulo)
        plt.axhline(y = 0, color = 'dimgrey', linestyle = '--')
        plt.axvline(x = 0, color = 'dimgrey', linestyle = '--')
        inercia_x = round(self.var_explicada[ejes[0]], 2)
        inercia_y = round(self.var_explicada[ejes[1]], 2)
        plt.xlabel('Componente ' + str(ejes[0]) + ' (' + str(inercia_x) + '%)')
        plt.ylabel('Componente ' + str(ejes[1]) + ' (' + str(inercia_y) + '%)')
        if ind_labels:
            for i, txt in enumerate(self.coordenadas_ind.index):
                plt.annotate(txt, (x[i], y[i]))
    def plot_circulo(self, ejes = [0, 1], var_labels = True, titulo = 'Círculo de Correlación'):
        cor = self.correlacion_var.iloc[:, ejes].values
        plt.style.use('seaborn-whitegrid')
        c = plt.Circle((0, 0), radius = 1, color = 'steelblue', fill = False)
        plt.gca().add_patch(c)
        plt.axis('scaled')
        plt.title(titulo)
        plt.axhline(y = 0, color = 'dimgrey', linestyle = '--')
        plt.axvline(x = 0, color = 'dimgrey', linestyle = '--')
        inercia_x = round(self.var_explicada[ejes[0]], 2)
        inercia_y = round(self.var_explicada[ejes[1]], 2)
        plt.xlabel('Componente ' + str(ejes[0]) + ' (' + str(inercia_x) + '%)')
        plt.ylabel('Componente ' + str(ejes[1]) + ' (' + str(inercia_y) + '%)')
        for i in range(cor.shape[0]):
            plt.arrow(0, 0, cor[i, 0] * 0.95, cor[i, 1] * 0.95, color = 'steelblue', 
                      alpha = 0.5, head_width = 0.05, head_length = 0.05)
            if var_labels:
                plt.text(cor[i, 0] * 1.05, cor[i, 1] * 1.05, self.correlacion_var.index[i], 
                         color = 'steelblue', ha = 'center', va = 'center')
    def plot_sobreposicion(self, ejes = [0, 1], ind_labels = True, 
                      var_labels = True, titulo = 'Sobreposición Plano-Círculo'):
        x = self.coordenadas_ind[ejes[0]].values
        y = self.coordenadas_ind[ejes[1]].values
        cor = self.correlacion_var.iloc[:, ejes]
        scale = min((max(x) - min(x)/(max(cor[ejes[0]]) - min(cor[ejes[0]]))), 
                    (max(y) - min(y)/(max(cor[ejes[1]]) - min(cor[ejes[1]])))) * 0.7
        cor = self.correlacion_var.iloc[:, ejes].values
        plt.style.use('seaborn-whitegrid')
        plt.axhline(y = 0, color = 'dimgrey', linestyle = '--')
        plt.axvline(x = 0, color = 'dimgrey', linestyle = '--')
        inercia_x = round(self.var_explicada[ejes[0]], 2)
        inercia_y = round(self.var_explicada[ejes[1]], 2)
        plt.xlabel('Componente ' + str(ejes[0]) + ' (' + str(inercia_x) + '%)')
        plt.ylabel('Componente ' + str(ejes[1]) + ' (' + str(inercia_y) + '%)')
        plt.scatter(x, y, color = 'gray')
        if ind_labels:
            for i, txt in enumerate(self.coordenadas_ind.index):
                plt.annotate(txt, (x[i], y[i]))
        for i in range(cor.shape[0]):
            plt.arrow(0, 0, cor[i, 0] * scale, cor[i, 1] * scale, color = 'steelblue', 
                      alpha = 0.5, head_width = 0.05, head_length = 0.05)
            if var_labels:
                plt.text(cor[i, 0] * scale * 1.15, cor[i, 1] * scale * 1.15, 
                         self.correlacion_var.index[i], 
                         color = 'steelblue', ha = 'center', va = 'center')

Ejemplo 1

os.chdir("/Users/oldemarrodriguez/Google Drive/MDCurso/Datos")
datos = pd.read_csv('EjemploEstudiantes.csv',delimiter=';',decimal=",",index_col=0)
print(datos)
        Matematicas  Ciencias  Espanol  Historia  EdFisica
Lucia           7.0       6.5      9.2       8.6       8.0
Pedro           7.5       9.4      7.3       7.0       7.0
Ines            7.6       9.2      8.0       8.0       7.5
Luis            5.0       6.5      6.5       7.0       9.0
Andres          6.0       6.0      7.8       8.9       7.3
Ana             7.8       9.6      7.7       8.0       6.5
Carlos          6.3       6.4      8.2       9.0       7.2
Jose            7.9       9.7      7.5       8.0       6.0
Sonia           6.0       6.0      6.5       5.5       8.7
Maria           6.8       7.2      8.7       9.0       7.0
print(datos.shape)
(10, 5)
acp = ACP(datos,n_componentes=5)
acp.plot_plano_principal()
open_close_plot()

# Plotea el círculo de correlación
acp.plot_circulo()
open_close_plot()

# Plotea la sobreposición plano-correlación
acp.plot_sobreposicion()
open_close_plot()

print("No funciona")
# Despliega las Componenentes Principales
No funciona
print(acp.coordenadas_ind)
# Despliega los cosenos cuadrados de los individuos
               0         1         2         3         4
Lucia  -0.323063  1.772525  1.198801 -0.055015 -0.003633
Pedro  -0.665441 -1.638702  0.145476 -0.023065  0.123377
Ines   -1.002547 -0.515692  0.628888  0.516444 -0.142876
Luis    3.172095 -0.262782 -0.381960  0.677777  0.062504
Andres  0.488868  1.365402 -0.835236 -0.155792 -0.123367
Ana    -1.708633 -1.021700 -0.127077  0.066833 -0.025292
Carlos -0.067586  1.462336 -0.506240 -0.117928 -0.013124
Jose   -2.011855 -1.275865 -0.542150 -0.197787 -0.017434
Sonia   3.042030 -1.254881  0.448829 -0.639999 -0.037885
Maria  -0.923869  1.369359 -0.029330 -0.071467  0.177730
print(acp.cos2_ind)
# Despliega las correlaciones de las variables con respecto a las componentes
               0         1         2         3         4
Lucia   0.022271  0.670421  0.306660  0.000646  0.000003
Pedro   0.139906  0.848431  0.006687  0.000168  0.004809
Ines    0.514469  0.136123  0.202440  0.136520  0.010449
Luis    0.936852  0.006429  0.013584  0.042771  0.000364
Andres  0.084140  0.656354  0.245604  0.008545  0.005358
Ana     0.732686  0.261980  0.004053  0.001121  0.000161
Carlos  0.001893  0.886081  0.106192  0.005763  0.000071
Jose    0.673612  0.270910  0.048917  0.006510  0.000051
Sonia   0.808830  0.137637  0.017607  0.035800  0.000125
Maria   0.308554  0.677869  0.000311  0.001846  0.011419
print(acp.correlacion_var)
                    0         1         2         3         4
Matematicas -0.895798 -0.345204  0.257979 -0.091468 -0.058828
Ciencias    -0.722798 -0.648395  0.023840  0.235878  0.030682
Espanol     -0.610893  0.717321  0.331025 -0.024542  0.045615
Historia    -0.599923  0.748470 -0.232063  0.156397 -0.039644
EdFisica     0.913926  0.119637  0.340651  0.183154 -0.028929

Graficando en componentes 1 y 3 (Numeradas con 0 y 2 en Python)

# Plotea el plano principal
acp.plot_plano_principal(ejes = [0, 2])
open_close_plot()

# Plotea el círculo de correlación
acp.plot_circulo(ejes = [0, 2])
open_close_plot()

# Plotea la sobreposición plkano-correlación
acp.plot_sobreposicion(ejes = [0, 2])
open_close_plot()

Ejemplo 2

os.chdir("/Users/oldemarrodriguez/Google Drive/MDCurso/Datos")
iris = pd.read_csv('iris.csv',delimiter=';',decimal=".")
print(iris.head())
   s.largo  s.ancho  p.largo  p.ancho    tipo
0      5.1      3.5      1.4      0.2  setosa
1      4.9      3.0      1.4      0.2  setosa
2      4.7      3.2      1.3      0.2  setosa
3      4.6      3.1      1.5      0.2  setosa
4      5.0      3.6      1.4      0.2  setosa
print(iris.shape)
(150, 5)
iris2 = pd.DataFrame(data=iris, columns=['s.largo', 's.ancho', 'p.largo', 'p.ancho'])
acp = ACP(iris2,n_componentes=4)
# Plotea el plano principal
acp.plot_plano_principal()
open_close_plot()
# Despliega las Componenentes Principales
# print(pca.row_coordinates(iris2))
# Despliega los cosenos cuadrados
# print(pca.row_cosine_similarities(iris2))
# Despliega las correlaciones de las variables con respecto a las componentes
# print(pca.column_correlations(iris2))
# Valores Propios
# print(pca.eigenvalues_)

# Plotea el círculo de correlación
acp.plot_circulo()
open_close_plot()

# Plotea la sobreposición plano-correlación
acp.plot_sobreposicion()
open_close_plot()

Análisis en Componentes Principales con variables categóricas

Códigos Dummy (Códigos Disyuntivos Completos) - pd.get_dummies(Datos)

Ejemplo 3

os.chdir("/Users/oldemarrodriguez/Google Drive/MDCurso/Datos")
datos = pd.read_csv('EjemploEstudiantes_Categoricas.csv',delimiter=';',decimal=",",index_col=0)
print(datos.head())
        Matematicas  Ciencias  Espanol   ...     EdFisica  Genero Conducta
Lucia           7.0       6.5      9.2   ...          8.0       F        3
Pedro           7.5       9.4      7.3   ...          7.0       M        2
Ines            7.6       9.2      8.0   ...          7.5       F        2
Luis            5.0       6.5      6.5   ...          9.0       M        1
Andres          6.0       6.0      7.8   ...          7.3       M        2

[5 rows x 7 columns]
print(datos.shape)
(10, 7)
print(datos.dtypes)
Matematicas    float64
Ciencias       float64
Espanol        float64
Historia       float64
EdFisica       float64
Genero          object
Conducta         int64
dtype: object

Recodificando la variable “Conducta” usando texto y luego se convierten a variables Dummy

datos["Conducta"] = recodificar(datos["Conducta"], {1:'Mala',2:'Regular',3:'Buena'})
print(datos.head())
        Matematicas  Ciencias  Espanol   ...     EdFisica  Genero Conducta
Lucia           7.0       6.5      9.2   ...          8.0       F    Buena
Pedro           7.5       9.4      7.3   ...          7.0       M  Regular
Ines            7.6       9.2      8.0   ...          7.5       F  Regular
Luis            5.0       6.5      6.5   ...          9.0       M     Mala
Andres          6.0       6.0      7.8   ...          7.3       M  Regular

[5 rows x 7 columns]
print(datos.dtypes)
# Conviertiendo la variables en Dummy
Matematicas    float64
Ciencias       float64
Espanol        float64
Historia       float64
EdFisica       float64
Genero          object
Conducta        object
dtype: object
datos_dummy = pd.get_dummies(datos)
print(datos_dummy.head())
        Matematicas  Ciencias        ...         Conducta_Mala  Conducta_Regular
Lucia           7.0       6.5        ...                     0                 0
Pedro           7.5       9.4        ...                     0                 1
Ines            7.6       9.2        ...                     0                 1
Luis            5.0       6.5        ...                     1                 0
Andres          6.0       6.0        ...                     0                 1

[5 rows x 10 columns]
print(datos_dummy.dtypes)
Matematicas         float64
Ciencias            float64
Espanol             float64
Historia            float64
EdFisica            float64
Genero_F              uint8
Genero_M              uint8
Conducta_Buena        uint8
Conducta_Mala         uint8
Conducta_Regular      uint8
dtype: object

ACP con variables Dummy

acp = ACP(datos_dummy,n_componentes=5)
acp.plot_plano_principal()
open_close_plot()

# Plotea el círculo de correlación
acp.plot_circulo()
open_close_plot()

# Plotea la sobreposición plano-correlación
acp.plot_sobreposicion()
open_close_plot()

FIN